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Abstract 

Our work is motivated by Bourque and Pevzner's (2002) simulation study of the effectiveness 
of the parsimony method in studying genome rearrangement, and leads to a surprising result 
about the random transposition walk on the group of permutations on n elements. Consider 
this walk in continuous time starting at the identity and let Dt be the minimum number of 
transpositions needed to go back to the identity from the location at time t. Dt undergoes a phase 
transition: the distance Dcn/2 ^ u{c)n, where u is an explicit function satisfying u{c) = c/2 for 
c < 1 and u{c) < c/2 for c > 1. In addition, we describe the fluctuations of -Dcn/2 about its 
mean in each of the three regimes (subcritical, critical and supercritical). The techniques used 
involve viewing the cycles in the random permutation as a coagulation-fragmentation process 
and relating the behavior to the Erdos-Renyi random graph model. 
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1 General motivation 



The relationship between the orders of genes in two species can be described by a signed permuta- 
tion. For example the relationship between the human and mouse X chromosomes may be encoded 
as (see Pevzner and Tesler (2003)) 

1 -7 6 -10 9 -8 2 -11 -3 5 4 

In words the two X chromosomes can be partitioned into 11 segments. The first segment of the 
mouse X chromosome is the same as that of humans, the second segment of mouse is the 7th human 
segment with its orientation reversed, etc. The parsimony approach to estimation of evolutionary 
changes of the X chromosome between human and mouse is to ask: what is the minimum number 
of reversals (i.e., moves that reverse the order of a segment and therefore change its sign) needed 
to transform the arrangement above back into 1, . . . , 11? In other words, what is the (reversal) 
distance between the human and mouse X chromosomes ? 

Hannehalli and Pevzner (1995) developed a polynomial algorithm for answering this question. 
The first step in preparing to use the Hannehalli- Pevzner algorithm is to double the markers. When 
segment i is doubled we replace it by two consecutive numbers 2i — l and 2i, e.g., 6 becomes 11 and 
12. A reversed segment —i is replaced by 2i and 2i — 1, for example, —7 is replaced by 14 and 13. 
The doubled markers use up the integers 1 to 22. To these numbers we add a at the front and a 
23 at the end. Using commas to separate the ends of the markers we can write the two genomes 
as follows: 

mouse 0, 1 2, 14 13, 11 12, 20 19, 17 18, 16 15, 3 4, 22 21, 6 5, 9 10, 7 8, 23 
human 0, 1 2, 3 4, 5 6, 7 8, 9 10, 1112,13 14, 15 16, 17 18, 19 20, 21 22, 23 

The next step is to construct the breakpoint graph (see Figure 1) that results when the commas are 
replaced by edges that connect vertices with the corresponding numbers. In the picture we have 
written the vertices in their order in the mouse genome. Commas in the mouse order become thick 
lines (black edges), while those in the human genome are thin lines (gray edges). 

Each vertex has one black and one gray edge, so the connected components of the graph are 
easy to find: start with a vertex and follow the connections in either direction until you come back 
to where you start. In this example there are five components: 

0-1-0 2- 14- 15 -3-2 4- 22 - 23 -8-9-5-4 
19-17-16-18-19 13-11-10-7-6-21-20-12-13 

To compute a lower bound for the distance, we take the number of commas seen when we write 
out one genome. In this example that is 12. In general, it is 1 plus the number of markers. We then 
subtract the number of components in the breakpoint graph. In this example that is 5, so the result 
is 7. This is a lower bound on the distance, since any reversal can at most reduce this quantity by 
1, and it is when the two genomes are the same. We can verify that 7 is the minimum distance 
by constructing a sequence of 7 moves that transforms the mouse X chromosome into the human 
order. There are thousands of solutions, so we leave this as an exercise for the reader. Here are 
some hints: (i) To do this it suffices, at each step, to choose a reversal that increases the number 
of cycles by 1. (ii) This never occurs if the two chosen black edges are in different cycles, (iii) If 
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the two black edges are in the same cycle and are (a, b) and (c, d) as we read from left to right, this 
will occur unless in the cycle minus these two edges a is connected to d and b to c, in which case 
the number of cycles will not change. For example, in the graph in Figure 1 a reversal that breaks 
black edges 19-17 and 18-16 will increase the number of cycles but the one that breaks 2-14 and 
15-3 will not. 

In general, the distance between genomes can be larger than the lower bound from the break- 
point graph. There can be obstructions called hurdles that can prevent us from decreasing the 
distance, and hurdles can be intertwined in a fortress of hurdles that takes an extra move to break. 
See Hannehalli and Pevzner (1995). In symbols, if vr is the signed permutation that represents the 
relative order and orientation of segments in the two genomes, then 

d{TT) =n+l- c{tt) + h{Tl) + f{TT) 

where d{Tr) is the distance from the identity, n is the number of markers, c(7r) is the number of 
components in the breakpoint graph, h{Tr) is the number of hurdles, and /(vr) is the indicator of 
the event vr is a fortress of hurdles. See Section 5.2 of Durrett (2002) or Chapter 10 of Pevzner 
(2000) for more details. 

Although (io(7r) = n + 1 — c(7r) is only a lower bound on the distance, it is the right answer 
in most biological examples. Bafna and Pevzner (1995) consider 11 comparisons of mitochondrial 
and chloroplast genomes and found that this lower bound gave the right answer in all cases. This 
pattern has continued in more recent work, see York, Durrett, and Nielsen (2002), and Durrett, 
Nielsen, and York (2003). The simulations in Figure 2 will give more evidence that (io(7r) and d{7r) 
are close in many cases. 

To motivate our main question, we will introduce a second data set. Ranz, Casals, and 
Ruiz (2001) located 79 genes on chromosome 2 of D. repleta and on chromosome arm 3R of D. 
melanogaster. If we number the genes according to their order in D. repleta then their order in D. 
melanogaster is given in Table 1. This time we do not know the orientation of the segments, but 
that is not a serious problem. Using simulated annealing, one can easily find an assignment of signs 
that minimizes the distance, which in this case is 54. Given the large number of rearrangements 
relative to the number of markers, we should ask: when is the parsimony estimate reliable? 

Bourque and Pevzner (2002) approached this question by taking 100 markers in order, per- 
forming k randomly chosen reversals to get a permutation vr/j, computing the minimum number of 
reversals needed to return to the identity, d['K}S)^ and then plotting the average value of d{'Kk) — k < 
for 100 simulations. They concluded, based on their simulations, that the parsimony distance for 
n markers was a good estimate as long as the number of reversals performed was at most 0.4n. 
In Figure 2 we have given —1 times their data. We have also repeated their experiment for the 
approximate distance (io(vr) = n + 1 — c(7r) and plotted the average value of /c — do{Trk) > for 
10,000 replications. Our curve is less random, but close to data of Bourque and Pevzner (2000). 
The smooth curve gives result of Theorem 3 for the limiting behavior of {tn — do ("^tn ))/""- (as a 
function of t). 

The biological question concerns the random reversal walk. However, it is also interesting to 
consider the analogous problem for random transpositions. In that case the distance from the 
identity can be easily computed: it is the number of markers n minus the number of cycles in the 
permutation. For an example, consider the following permutation of 14 objects written in its cyclic 
decomposition: 

(174) (2) (312) (5139116) (81014) 
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which indicates that 1 — > 7, 7 — > 4, 4 — > 1, 2 — > 2, 3 — > 12, 12 — > 3, etc. There are 5 cycles so 
the distance from the identity is 9. If we perform a transposition that includes markers from two 
different cycles (e.g., 7 and 9) the two cycles merge into 1, while if we pick two in the same cycle 
(e.g., 13 and 11) it splits into two. 

The situation is similar but slightly more complicated for reversals. There a reversal that 
involves edges in two different components merges them into 1, but a reversal that involves two 
edges of the same cycle may or may not increase the number of cycles. One can attempt to couple 
the components of the breakpoint graph for random reversals on n — 1 markers and the cycles of 
random transposition of n markers as follows: number the edges between markers in the reversal 
chain (including the ends and n); when markers i and j are transposed, do the inversion of edges 
numbered i and j. The result of the coupled simulation is given in Figure 2. As expected time 
minus distance is smaller for reversals but the qualitative behavior is similar. Thus, we will begin 
by considering the biologically less relevant case of random transpositions, and ask a question that 
in terms of the rate 1 continuous time random walk on the symmetric group is: how far from the 
identity are we at time cn/2? We will see later that parts of the answer can be extended to the 
reversal random walk. 

2 The coagulation-fragmentation process and the random graph 
process 

Let {crt,t > 0) be the continuous-time random walk on the group of permutations, starting at the 
identity, in which, at times of a rate one Poisson process, we perform a transposition of two elements 
chosen uniformly at random, with replacement, from {1, . . . , n}. Choosing with replacement causes 
the chain to do nothing with probability 1/n, but makes some of the calculations a little nicer. If we 
think of the permutation a as being represented by numbered balls sitting on numbered locations 
with ball o"(/e) sitting at k, then transposition of i and j, pij, can be implemented in two ways. We 
can exchange the balls at i and j or the balls numbered i and j. Algebraically these correspond 
to pija and crpij. Since {(Tpij)~^ = pija~~^ and the partition of {1, . . . ,n} induced by the cycle 
decompositions of a and are equal, the results are the same for either random walk. 

Define the distance to the identity Dt to be the minimum number of transpositions one needs to 
perform on at to go back to the identity element. A different way of looking at Dt is the following, 
(ftji ^ 0) can be viewed as a random walk on a graph G, where G is the Cayley graph of the 
symmetric group for the set of generators given by the set of all transpositions. Using this language, 
we see that Dt is nothing but the graph distance from at to the origin, the identity element. 

It is clear that if Nt is the number of transpositions distinct from the identity performed up 
to time t (a Poisson random variable with mean t[l — 2/(n — 1)]), then Dt < Nt- As mentioned 
earlier Dt is given hy Dt = n — \at\, where |cri| is the number of cycles in the cycle decomposition 
of at- This formula allows us to turn any question about Dt into a question about \at\- The key 
to studying \at\ is that the cycles evolve according to the dynamics of a coagulation- fragmentation 
process. When a transposition pij occurs, if i and j belong to two different cycles then the cycles 
merge. On the contrary, if they belong to the same cycle, this cycle is split into two cycles. From 
the definition it can be seen that the ranked sizes of the cycles form a coagulation-fragmentation 
process (see Aldous (1999) and Pitman (2002), (2003)) in which components of size x and y merge 
at rate Kn{x,y) = 2xy/'n? and components of size x split at rate Fn{x) = x{x — l)/n^ and are 
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broken at a uniformly chosen random point. Diaconis, Mayer- Wolf, Zeitouni, and Zerner (2003) 
have recently considered the corresponding Markov chain on partitions of the unit interval and 
shown that the Poisson-Dirichlet distribution is the unique invariant measure. 

To study the evolution of the cycles in the random permutation, we construct a random graph 
process. Start with the initial graph on vertices {1, . . . , n} with no edge between the vertices. When 
a transposition of i and j occurs in the random walk, draw an edge between the vertices i and j. 
To take care of the rare event that a given transposition is chosen several times, we will allow the 
possibility of multiple edges, and draw a second edge if one is already present. It is easy to see that 
in our continuous time process, at time t this graph is a realization of the Erdos-Renyi random 
graph G{n,p), in which edges are independently present with probability p = 1 — exp(— 2t/?i^), see 
Bollobas (1985) or Janson, Luczak, and Ruczinski (2000)^. It is also easy to see that in order for 
two integers to be in the same cycle in the permutation it is necessary that they are in the same 
component of the random graph. 

To estimate the difference between cycles and components, let Ft denote the event that a 
fragmentation occurs at time t. It is clear that 

A = iVt-2j]V4 (1) 

s<t 

A fragmentation occurs in the random permutation when a transposition occurs between two 
integers in the same cycle, so tree components in the random graph correspond to unfragmented 
cycles in the random walk. (Here and in all that follows, "tree" has a multi-graph meaning : it 
is a connected component with no nontrivial closed circuit.) Unicyclic components (with an equal 
number of vertices and edges) correspond to cycles in the permutation that have experienced exactly 
one fragmentation, but we need to know the order in which the edges were added to determine the 
resulting cycles. For more complex components, the relationship between the random graph and 
the permutation is less clear. Fortunately, these can be ignored in the proofs of our results. 

3 Limit Theorems 

We will now describe our results and sketch their proofs. Rigorous proofs of the results stated in 
this section can be found in sections 4, 5 and 6. 

3.1 The subcritical regime 

Theorem 1. Let < c < 1. The number of fragmentations 

Zc := ^ 1{F,} Poisson(K(c)) (2) 

s<cn/2 

where k{c) = (— log(l — c) — c)/2. In fact, the convergence holds for the process {Zc : < c < 1} 
with the limit being a Poisson process with compensator k(c). 

^The fact that we allow multiple edges makes no difference. At each point where the distinction with the usual 
Erdos-Renyi random graph may be relevant, a very simple calculation shows that the effect of multi-edges can be 
neglected (see Janson et al. (1993), where this issue is discussed). To make the core of our arguments simpler to 
follow, we will ignore this distinction from now on. 
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Remark. The result for fluctuations is formulated in terms of fragmentations rather than the 
distance, since -Dcn/2 ~ cn/2 « ~ cn/2 = 0(n-^/^). For the embedded discrete time chain, if 

k = \cn/2\ , then 

{k — Dk)/2 =^ Poisson(K(c)) as n ^ oo (3) 

We divide by 2 since a fragmentation reduces the distance by 1 instead of increasing it by 1. To 
deduce Q from © we note that time k in the discrete walk corresponds to time N~^{k) ss cn/2 
in the continuous time walk. 

Sketch of the proof. The process {Zc, < c < 1} is a cadlag counting process. Therefore by 
arguments from Jacod and Shiryaev (1987), it is enough to show that its compensator converges 
to the deterministic limit k(c). If /fc(t) is the fraction of vertices that belong to cycles of size /c, the 
rate at which fragmentations occur is just fk{t){k — l)/n. Hence is just the integral with 
respect to time of this rate. We first show that the variance converges to and then, by Chebycheff 's 
inequality, it only remains to show Ek'^{c) — > k(c). But by exchangeability = -P[|Ci| = k] 

where \Ci\ is the size of the component that contains 1 at time t. It is not hard to see that this 
quantity at time 6n/2 converges in distribution to the total progeny r of a Galton- Watson branching 
process with offspring distribution Poisson(6), or PGW{b). Summing the geometric series, we see 
that Et = 1/(1 — b). Integrating with respect to b we get the desired expected value, k{c). □ 

To prepare for later developments, it is useful to take a second combinatorial approach to this 
result. We begin with Cayley's result that there are k^~'^ trees with k labeled vertices. At time 
cn/2 each edge is present with probability 1 — exp(— c/n) ~ c/n so the expected number of trees 
of size k present is 

"^A;^-2 f^y-' (i _ cx'^-("-fc)+(^)-fc+i 



kj \nJ \ n 

since each of the k — 1 edges needs to be present and there can be no edges connecting the k 
point set to its complement or any other edges connecting the k points. For fixed k the above is 
asymptotic to 

n— — c""^ 1 ^ 



W V-n) 

The quantity in parentheses at the end converges to e~'^^ so we have an asymptotic formula for the 
number of tree components at time cn/2. As a side result we get the following known result: 



Corollary 1. The probability distribution of the total progeny T of a Poisson(c) branching process 



with c <l is given by P{T = k) = i^^(ce '')^ 



See section 4.1 of Pitman (1999) for another proof of this result. It was first discovered by 
Borel (1942) and the distribution of T is called the Borel distribution. It is a particular case of the 
so-called Borel- Tanner distribution, see Devroye (1992) and Pitman (1998) for further references. 
In this context it appeared in the problem of the total number of units served in the first busy 
period of a queue with Poisson arrivals and constant service times. See also Tanner (1961). Of 
course, this becomes a branching process if we think of the customers that arrive during a person's 
service time as their children. 
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3.2 The critical regime 



It is well known in the theory of random graphs that the correct time-scale to describe the critical 
regime is (n/2)(l + Xn~^^'^), A G (—00,00). See Aldous (1997) for an interesting account that 
relates the growth of large clusters in the critical random graph to the multiplicative coalescent. At 
times (n/2)(l — n~^) with r < 1/3, we are still in the subcritical regime, so the arguments in the 
proof of Theorem n when done more carefully, are still valid. More precisely, we can show that if 
Cn{r) = 1 — n~'^l'^ for < r < 1, then the expected number of fragmentations up to time Cn{r)n/2 
is again given by K(cn(r)) ~ (?^/6) log?i. Hence define: 




Theorem 2. n ^ 00, Wn{-) converge weakly, with respect to the Skorokhod topology on the 
space of cddlag functions on [0, 1], to {W{r),0 < r < 1}, a standard Brownian Motion on [0, 1]. 
Furthermore, 

(i^)"'(Ei™4>o."]-«'(i). w 

\s<n/2 I 

Sketch of the proof, hituitively, the first result is an immediate consequence of the Poisson limit 
in Theorem^ and the normal approximation to the Poisson. To prove it, we show that Wn(r) 
is a martingale, whose jumps are asymptotically zero, and whose quadratic variation process is r 
thanks to our time-change c„(r) = 1 — n"*"/^. Therefore it converges to Brownian Motion. 

At times (1 - n"^/^)n/2 < t < n/2 we are in the critical range of the random graph. Results 
of Luczak, Pittel, and Wierman (1994) and computations with (jlj imply that the number of 
fragmentations in this interval is bounded in expectation and hence can be ignored. □ 

Remark. While Theorem |2l is a nice theoretical result, it does not have much to say about any 
biological example. If we think of the human genome and set n = 3 billion nucleotides. Theorem 
2 says that after n/2 = 1.5 billion transpositions there have been an average of (logn)/6 = 3.63 
fragmentations, with a standard deviation of 1.91. These numbers are small so even for n = 3 
billion, we can't expect a very good approximation to the normal distribution. In the example that 
we simulated n = 100 and (logn)/6 = 0.767 versus an observed average number of fragmentations 
= 0.662 (which translates into a value of 1.224 in Figure 2). While our estimation of the mean is 
not very accurate. Figure 3 shows that the distribution of the number of fragmentations is almost 
Poisson. 



3.3 The supercritical regime 

This is the most interesting case, and also the hardest one. We start by establishing a law of large 
numbers. For all c > define 

so that for c < 1 it coincides with the Borel distribution of Corollary ^ When c > 1, 

lim P{\Ci\ = k) = (3k{c) 
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still holds but the /3fc(c)'s no longer sum up to 1 because there is a probability /3oo(c) = 1 — 
J2k>i l^k{c) > that Ci is the giant component. 

Let us denote by T(c) a random variable that takes the value 1/k with probability /3fc(c) when 
1 < /c < oo and the value with probability /3oo(c). The motivation for this definition is that l/|Cfc| 
has the same distribution as T(c) and Yl^=i gives the number of components in the random 

graph. 

Theorem 3. Let c > be a fixed positive number. Then the number of cycles in the random 
permutation at time cn/2, \o'cn/2\ = 5'(c)?^ + Lo{y/n), where 

g{c):=ET{c) = Y,-^{ce-^f (7) 

k=l ^ 

and the error term uji^^/n) j an\/n in probability if an oo. 

Note that the theorem is valid for all regimes and implies that the distance is given by -Dcn/2 = 
u{c)n + uj{^/n) where u{c) = 1 — g{c). Although it is not obvious from the formula, u{c) = c/2 
for c < 1 and u{c) < c/2 when c > 1. Using Stirling's formula, k\ ~ k^e~''\^27Tk, it is easy to 
check that g' exists for all c and is continuous, but g"{l) does not exist. In words, there is phase 
transition in the behavior of the distance of the random walk to the identity at time n/2 from linear 
to sublinear. 

Proof. In the supercritical regime the dynamics of the large components is quite complicated, but 
there can never be more than ^/n components of size ^/n or larger. The expected number of 
fragmentations that produce clusters of size smaller than -y/n by time cn/2 is at most n"^/^ • cn/2. 
From this and Chebyshev's inequality we see that up to a term u;(n^/^), \crcn/2\ is the number of 
components of the random graph, and the result follows Theorem 12 in Chapter V of Bollobas 
(1985). □ 



Theorem 4. Let c > 1. As n — > oo, 

Dcn/2 - u{c)n 



MiO,a^) (8) 



nV2 

where a = p[l + p{c/2 — l)], and p = \ — 9{c) is the extinction probability of a supercritical PGW {c) . 



Remark. Note that the constant a is different from the one given in Berestycki and Durrett 
(2003). We were correct in claiming that the central limit theorem in Theorem^ is the same as the 
one for the number of components of the random graph, but we naively thought that the terms in 
Yl^=i were sufficiently independent so that o"^ = var (T(c)). 

Sketch of Proof. By Pittel's (1990) central limit theorem for the number of components of a 
random graph, it suffices to prove that the number of extra components due to fragmentation at time 
cn/2 is o{y/n) (see his Corollary 1 and note that T/c = p). Our first step is to increase the cutoff for 
large cycles to n"' where a > 1/2, so that the number of large cycles is at most n^~" = o(n^/^). The 
number of fragmentations that produce "small" cycles is now n~^^~°'^ •cn/2 = 0{n°') and cannot be 
ignored, so we need to use the fact that fragmented cycles are reabsorbed by the large components. 
If the fraction of mass in large cycles ( "upstairs" ) at time tn is Af then new fragments of size k are 
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produced at rate < 2Xt and each fragment of size k is reabsorbed at rate 2kXt. After time change 
this is bounded by an M/M/oo queue in which the expected number of customers in equihbrium is 
1/k. Using this, we can show that with high probabihty the number of smaU fragments at any time 
is at most (logn)^. Of course, the coagulation fragmentation process is not exactly the queuing 
system. Customers can split into two, coalesce with other customers, gain weight (and increase 
their fragmentation rate) by eating small components, etc. However, (logn)^ is much smaller than 
n^/^ so crude but robust estimates and patience eventually lead to a proof. □ 

3.4 Results for Reversals. 

Theorems 3 and 4 extend easily to the approximate distance for reversal chain. Recall that the 
main difference lies in the fact that, a reversal involving edges from different components in the 
breakpoint graph always yields a coagulation, but one involving two edges in the same component 
may or may not cause a fragmentation. The proofs of Theorems 3 and 4 for transpositions are 
based on showing that fragmentations can be ignored, so this difference is unimportant and these 
results extend to reversals. As Figure 2 shows, this is not true for the more precise results in 
Theorems n and 121 For example, the underlying data shows that up to c = 1, an average of 23% 
of the reversals have caused no change in the distance. Since inversions that affect an edge are 
much more frequent than those that involve it, it seems reasonable to guess that in the limit as 
n — > CO the relative orientations of the black edges in a component of the breakpoint graph are 
independent. This would imply that the Poisson process of fragmentations in the reversal case is a 
1/2-thinning of the one for transpositions, and Theorem [2 would hold with 6 replaced by 12. 

3.5 Emergence of a giant cycle? 

Since cycles in the random permutation are smaller than components of the random graph, it 
follows that if c < 1 then the largest cycle at time cn/2 has fewer than a(c)~^ logn vertices, where 
a(c) = (c — 1 — logc). (See Theorem 10 in Chapter V of BoUobas (1985) or Lemma |21 below.) 

For c > 1, the largest component of the random graph is, as is well known, "giant, " meaning 
that it is of order n. In fact it is asymptotic to 9{c)n where 9{c) is the survival probability of 
a supercritical Poisson Galton- Watson with mean c. It is a natural question to ask whether the 
largest cycle of the random permutation is also giant in the supercritical regime. 

Conjecture. Let Li(t) be the size of the largest cycle at time t. If c> 1 then 

Li(cn/2) ^ 
e{c)n 

where V is a random variable with < y < 1 a.s. 

This problem is quite different from our original one. However our techniques enable us to prove a 
partial result in this direction as a corollary of the proof of Theorem [l] 

Theorem 5. For any c > 1, at time cn/2 there are at least 0{c)n — o{n) vertices located on large 
cycles (i.e., of size greater than or equal to n"-, for any a < 2/3). 

David Aldous (private communication) conjectures that the relative sizes of the pieces of the giant 
cycle are in equilibrium at all times in the supercritical regime, i.e., have the Poisson-Dirichlet 
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PD(0, 1) distribution, which gives the Umiting behavior of the ordered sizes of cycles in a uniform 
random permutation. According to this conjecture, V would be distributed as the first coordinate 
of a PD{0, 1) random variable. One way to approach this conjecture would be to generalize Aldous 
(1997) to show that the large cycles in the critical regime converge to a coagulation-fragmentation 
process and to study the growth of clusters in that process. 

Alternatively, one could look at the size of the cycle containing 1, Ki{t), and try to show that 

K,{cn/2) 
e{c)n 

where U has a point mass of size l — 6{c) at and is otherwise uniform on (0, 0{c)). Figure 4 shows 
the average growth of Ki[cn/2) /n in 10,000 simulations of n = 100, n = 1000, and compares the 
results to EU = 0(c)^/2. Although this considers only one aspect of the distribution of large cycles, 
it agrees well with Aldous' conjecture. 

Figure 5 shows a histogram of the result of 100,000 simulations of i^i(lOO) when n = 100. As the 
graph shows, the spike in the frequency of clusters of size 4 or smaller is what one would predict 
from the random graph cluster size distribution. The remainder of the distribution is roughly 
uniform except for rounding at the upper end. The latter is to be expected if Aldous' conjecture is 
correct, since the size of the giant component satisfies the central limit theorem. 

As we were finishing this paper, we learned that Oded Schramm (private communication) has 
proved David Aldous' conjecture. 

Remark. The problem of the emergence of a giant cycle is closely related to Angel's (2003) work 
on the existence of infinite orbits for the random stirring process^ which is the random transposition 
random walk on an infinite graph such as Z"' or a tree, rather than the complete graph on {1, . . . , n} 
considered in this work. To explain the connection, suppose that we construct our process using 
a Poisson process with rate for each i ^ j, and at these times draw an edge between i and j 
to indicate that i and j are to be transposed. To compute the cycles in the permutation at time 
cn/2, we repeat the first [0, cn/2] units of time periodically and then observe the sites that a walker 
starting at i visits at times kcn/2, for k = 1, 2, . . .. Angel (2003) calls this construction the cyclic 
time random walk. Its relevance to his work is that the cyclic time random walk is transient if, and 
only if, the cycles are infinite. 

4 The subcritical regime 

Let us introduce some notations for the different probability laws involved. For each n, we have 
the coagulation-fragmentation process, and the Erdos-Renyi random graph model. To emphasize 
when computations are being done for the random graph we will use Qp, for the random graph with 
Bernoulli percolation parameter p, and Q for the law of the evolving random graph that at time s 
has ps = 1 — exp(— 2s/?i^). When s = cn/2 this probability is p{c,n) = 1 — exp(— c/n) < c/n. To 
simplify notation we will use QX to denote the expected value of X with respect to the probability 
Q. 

4.1 Preliminary results : comparison with a branching process 

Our first result provides a useful upper bound. 
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Lemma 1. The cluster size \Ci\ in Qc/n ^-^ dominated by Z, the total progeny of a branching process 
in which each individual has a Binomial{n — 1, c/n) number of children, i.e., we can construct these 
random variables on the same probability space so that \Ci\ < Z a.s. It follows from this that if 
c < 1 then Qc/n\C-i\ < 1/(1 — c). 

Proof. Intuitively, this holds since a vertex in generation k may have children among all of the n 
vertices of the graph except those of the first k generations. To begin to prove this formally, let 
Cij) 1 < J < ?T- be independent random variables, taking values 1 with probability c/n and with 
probability 1 — c/n. To start the random graph let Yq = {1} and let Yi = {j Yq : ^ij- = 1}. To 
start the branching process let Zq = 1, Zi = \Yi\, and let (/>i : Yi — > {1, 2, . . . Zi} be 1-1 and onto. 

If the first k stages of the construction have been done and we have Yfe 7^ and a 0^ : Yfc — > 
{1, . . . Zk} that is 1-1 (but not onto in general), then let 

Yk+i = Uigyjj utoYi ■■ kj = 1} 

We let individual </>fc(i) in the kth generation of the branching process have |{j 7^ « : Ci.j = 1}| 
children. The individuals in the branching process that are not in (j)k{Yk) have a number of children 
given by independent binomials. It should be clear from the construction that can again define 
(j)k+i ■ Yfc+i — > {1, . . . Zk+i} to be 1-1, and the comparison follows by induction. The inequality 
follows by computing EZ (for instance by summing a geometric series). □ 

The next result shows that the bound in Lemma ^ is exact in the limit. Let {Zk}'^Q be a 
Poisson Galton- Watson process with offspring mean c and let Z = YlT=o be its total progeny. 

Lemma 2. Let Ci be the cluster that contains vertex 1. //O < c < 1 then as n —> 00 

Qp^c,n){\Cl\ = k) ^ P{Z = k) 

Proof. The number of children of vertex 1, = jYil has distribution Binomial(n — l,p(c, n)), 
which converges to a Poisson(c) limit. Let k>l and let (ni, ...,nfc_|_i) S N'^"^^. If we let = \Yj\ 
then 

Qp(c,n)(^fc+i = nfe+i|Zr = ni, ZI =nk)=P = n^+i^ 

where are i.i.d. Binomial(n — s,p{c,n)) random variables, and s = X^^Lo^fc with no = 1- From 
this it follows easily that the convergence of finite-dimensional distributions of {Z"}j>i to those of 
PGW{c). Markov's inequality and the domination result in Lemma pimply that 

Qpicn) ( f; > J < ( f; zA < c^/(i - c) 

\k=K / \k=K / 

and the desired conclusion follows. □ 

Our next ingredient is 
Lemma 3. Qc/n{\Ci\ > y) < c~^exp(— (c— 1 — lnc)y). 
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Proof. In view of Lemma ^ it suffices to prove the result for Z, rather than |Ci|. To do this, let 



m=0 



g0m 



1- - + -e^ 
n n / 

be the moment generating function of the distribution of the number offspring minus 1. Let Sm 
be a random walk that takes steps with this distribution and 5*0 = 1, so that 5m explores the 
Galton- Watson tree. Then r = inf{m : Sm = 0} has the same distribution as Z. Let Rm = 
m is a nonnegative martingale. Stopping at time t we have > E{<j3n{0) ^). 
If (t)n{0) < 1 it follows that 

p{t > y)Mor'' < EiMoyi < 

Using (pni^) < e~^exp(c(e^ — 1)) now we have 

P{t >y)<e^ fe~^exp(c(e^ - 1))^^ 



To optimize the bound we want to minimize c(e^ — 1) —9. Differentiating this means that we want 
ce^ — 1 = or = — log(c). Plugging this and recalling that r and Z have the same distribution 
we have 

P{Z > y) < - exp(- (c - 1 - hi c)y) 

It follows that ^ 

Qc/n{\Ci\ >y)< -exp(-(c - 1 - lnc)y) 

which completes the proof of Lemma |21 □ 

Now recall that for c < 1, Zc = ^s<cn/2 '^{Fs} number of fragmentations up to time cn/2. 

Lemma 4. Let fk{s) be the empirical fraction of vertices in cycles of size k at time s. //O < c < 1 
then E fk{cn/2) P[Z = k) and EZc — > k{c), where k(c) was defined in Theorem^ 

Proof. The cycle sizes at time s in the coagulation-fragmentation process are dominated by the 
cluster sizes in the random graph model with = 1 — exp(— 2s/n^) < 2s/n?. Therefore, 

/.cn/2 k-l /■cn/2 ^ 

EZc< / Qfkis) ds< / Q2s/riACi\ ds 

Jo n Jq ' n 

Using LemmanQ2s/n2|Ci| < 1/(1 — (2s/n)). Changing variables un/2 = s we have 

EZ, < -^{log{l - c) + c) = k{c) (9) 

Since unfragmented cycles are the same as tree components in the random graph, the first conver- 
gence result follows from Lemma |21 The second one follows from Fatou's lemma and Q. □ 

The final preparatory step is: 
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Lemma 5. If c < 1 the expected number of fragmentations that occur to cycles that have already 
been fragmented is < Kc{log n)^ jn, and Kc = 9cK;(c)a(c)~^ . (Recall q(c) = (c — 1 — log c) ). 

Proof. The expected number of such fragmentations is at most: 



j-cn/2 vertices in fragments Li{bn/2) 

< E at 

Jo n n 

where Li(t) is the size of the largest component at time t. In the event that Li (cn/2) < 3a(c)^^logn, 
the above is at most 

(n/2)(3a(c)~Mogn/n)2 [\{b)db < \k^^^^^ 

Jo 2 n 

On the other hand by Lemma IHl the complement of this event has probability at most n~^, and 
there can never be more than cn/2 such fragmentations, so Lemma[Slis proved. □ 



4.2 Proof of Theorem 1 

We are now ready to prove Theorem^ Let = J2s<cn/2 '^{Fs}^ < c < 1 be the counting process 
of fragmentations that occur to cycles which (a) have not been fragmented previously and (b) have 
size < n^''^ . The second condition is irrelevant in this section, but imposing it now will help in the 
next one. Unfragmented cycles correspond to trees in the random graph so the compensator of ZJ} 
is 

f-cn/2 

k"(c) = / ^p^ds (10) 
Jo 

_ 0,7 _ _ 

where = X^^^^ fk{s){k — l)/n and fk{s) is the fraction of vertices that belong to tree components 
of size k. As noted in the sketch of the proof, it is enough to show that for each fixed c, k"(c) 
converges in probability to k{c), or, by Lemma |SJ that k"(c) converges to k(c) in probability. 
Lemmas El and imply that Elf^"'^'^ tp'^ds] — > k{c). It remains to show that var J^^^'^tl^gds — > 0. 
Our first step will be to prove : 

var(C)<5gp(c,n)[|CiP] (11) 

for all time s < cn/2, where is a constant that depends only on c. 
To see this, first observe that in terms of cluster sizes 

1 " 

1=1 

where /j is the indicator of the event that Cj is a tree. Let di = (|Cj| — l)/j. 

Y3xX^{di^ h dn) = \ (nvar {di) + n(n - l)cov (^1,(^2)) (12) 
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Monotonicity and Lemma El imply, 

var((ii) <Qp(,,„)[|Ci|2] 
It remains to bound cov {di, ^2). If we let 

where ps = 1 — exp(— 2s/n^) then by the reasoning for Q) we have 



Qp^[Ci n C2 = 0, \Ci\ = j, IC2I = k, Ci and C2 are trees] 



n - 2 
J-1 



vr. 



n-j-1 
k-1 



QpJCi = C2, |Ci| = /c, Ci is a tree] 



n - 2 
k-2 



From this it follows that cov (di, d2) 

E/n — 2\ / n — j — 1 



k-l 



n — l\ fn — \ 



{j -l){k- l)7r„,_ 



n - 2 
/c - 2 



(fc - 1)V, 



For the first term in the right-hand side, 

— 2\ (n — j — 1 



< 



k-l 
{n - 2)!e' 



n — l\ / n — 1 
j-l)[k-l 
(n- 1)! 



(n-1)! 



(j - ly.ik - l)!(n -j-k)l (j - l)!(n - j)l (k - l)!(n - A;)! 



< 



since (n — 2y.e^ < (n — 1)! for large n and (n — j)\/{n — j — k:)\ < (n — l)!/(n — 1 — h)\. 
For the second term, 

A: ^ ^ fc ^ ^ 

Combining this with (|12|) and (fT3]) gives (fTT|) . 
Hence by the Cauchy-Schwarz inequality we get: 



var [ j ii'gds 



Q 



t _ _ X 2- 

ii;':-Qm)ds 







<t var ('0^)ds 
'0 



^ ^ / -^ds = - >0 

2 Jq An 



where we have used both and Lemma 3. This completes the proof of Theorem ^ 
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5 The critical regime 



The first step in the proof of Theorem [21 is to argue that fragmentations of previously fragmented 
cycles can be ignored. The number of such fragmentations is smaller than the total number of 
cycles in multicyclic components (i.e., components with at least 2 cycles) in the random graph. 
Theorem 1 and Corollary 3 in Luczak, Pittel, and Weirman (1994) imply that the total number of 
cycles in multicyclic components in the critical regime is bounded in probability.^ In particular, 
divided by (logn)^/^ it converges to in probability. As a result, by the converging together lemma 
(see e.g., Durrett (1996), Chap. 2, Ex. 2.10), it suffices to prove the central limit theorem for the 
number of fragmentations on tree components. 

As in the previous section, we will in addition restrict our attention to fragmentations of tree 
components of size at most nP''^ , and continue to use the notation introduced there. (Indeed, 
classical results from the theory of random graphs, or Aldous (1997), show that asymptotically 
almost surely all clusters are smaller than n^'"^). 

Let Wn{r) := (6/ log n)^/^(Z"'(r) — K;"'(r)). By standard methodology in the theory of stochastic pro- 
cesses (see Jacod and Shiryaev (1987) or Revuz and Yor(1999) for instance), to prove convergence of 
Wn{-) to Brownian Motion, the two things we need to check are: (i) E[supQ^j.^i\Wn{r) — Wn{r~)\] 
and (ii) The quadratic variation of Wn, i.e. the increasing process associated with Wn{-)'^, must 
converge to r at time r. (i) is obvious because is a counting process, and (ii) turns into 
E{6Z'^{r)/ logn) r and var (6Z"(r)/ log n) — > 0. These two steps are dealt with respectively in 
lemmas [3 and |H1 

But first, we need a technical lemma that will be useful on several occasions (e.g., for computing 
precise asymptotics of the number of trees of a given size) . 

Lemma 6. //fc — > oo and k = o(n^/^) then 



where a{c) = c — 1 — log(c). There is a constant K so that if I < k < nP'"^ and c < 1 then 

ln,k{c) < K\n,k{c)- 

Proof. Stirling's formula implies k\ ~ k^e~^\/2TTk. Using this we have that 

C\ fcn-fc2/2-3fc/2+l 



, . nk 

7n,fc(c) ~ 



cV27r 



fc-1 

J 



n 



e^c'^ ( 1 



n 



Using the expansion log(l — x) = —x — x^/2 — — ... we see that ii k = o{n) then 

C\ fcn-fc2/2-3fc/2+l 

1 ~ exp(-cA; + k^ /2n) 



^This result could also be derived from the Folk Theorem 1 in Aldous (1997) which gives the limit for the joint 
distribution of the component sizes and the number of cycles they contain. See the discussion page 850 of his paper. 
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while if /c = o(n^/^) we have 



k-l 



i=i i=i ^ ^ 

A;(A:-1) A;(A; - 1)(2A: - 1) 



fc-i 



exp 



exp 



2n 



6n^ 



exp 



A;2 fe3 
2n 



Combining the last three formulas gives the asymptotic formula. To prove the bound we note that 
Stirling's formula implies k\ > 6k^ e~^V2T^ for some 5 > 0. Using the bounds log(l — x) < —x and 
log(l — x) < —X — in the last two calculations gives the upper bound. □ 



Lemma 7. 



E 



6 



log n Jq 



c„{r)n/2 



Proof. The upper bound follows from Q which holds for all c < 1. In the other direction, changing 
variables s = Cn{v)n/2 where Cn{v) = 1 — n~*'/^ and noting c'^{v) = (l/3)(log 



n)n 



-v/2, 



gives 



E 



log n 



„0.7 

/ Y] Qp{c„{v),n) [kTk]n-'''^ dv 

k=l 



(15) 



where is the number of tree components of size k, and p{c,n) = 1 — exp(— c/n). 

We can take the limit of the last expression by using formula combined with Lemma El 
Indeed formula @ shows that ETk = '~fn,kic), and k < = o(n^/^), so that the use of Lemmal^l 
is justified. Hence 



E 



6 



logn 



Z"(r) 



k=l 



k{k-l] 



n 



Setting c = 1 — b with b = n ""^^ — > and using Taylor's theorem 

k^ ?)2 ^,2 

-(c - 1 - log(c))A; -b— = —k -b— + o{b^k) 
n 2 n 

The first term becomes significantly negative when /c ~ 1/6^ = it?^/^ , the second when k ~ \fnjb = 
When V < r < 1 the first threshold is smaller and the second term can be ignored. Thus 
Lemma ^ and the last observation imply that if w < 1 



^-Qpic„iv),n)[Tk] -^Y.^-'''eM-n-'^'''k/2) 
k=i ^ '^'^ k=\ 



(16) 



Here we have used the asymptotic formula of Lemma IHl for all k. However, the next computation 
will show that the sum grows like rfl'^ so the contributions from small k can be ignored. 
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If we view the sum in ()16|) as a Riemann sum with spacing n '^'"/^^ we can rewrite it as 

^ n-2-/3(n-2-/3A;)-V2 exp(-n-2-/3A:/2) 

k=l 

From this it follows that 



n 

k=l 



Changing variables x = y"^ , dx = 2y dy the integral becomes (27r) ^^"^ 2e /'^ dy = 1. Therefore, 
by Fatou's lemma: 



liminfn^noi? 



logn 



□ 



We turn now to the analysis of the variance. 
Lemma 8. var (^i^K„(r)^ 

Proof. Changing variables as in (|15j) and using Cauchy-Schwarz inequality as in dJ, 

/ g j-Cn{r)n/2 



var 



log n Jq 



2 / n-2''/3var(^,"„(,)„/2)rf^ 



Reasoning as in but using the bound in Lemma El 



fc=l A;=l 



To check the right-hand side note that the power of k has increased by 2, from the previous 
calculation. If we view the last sum as a Riemann sum with spacing n~'^'"/'^ , we can rewrite it as 



n 

,5 



n 

k=l 



Now x^/^e"^/^ has derivative ((3/2)x^/^ — (l/2)x3/^)e~^/^ so it is increasing on [0,3] and then 
decreasing on [3, cx)). Thus if we discard the term with the largest k so that n~'^'"/^k < 3 we have 
a lower bound on the integral. 
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Using this it follows that 

var ( ^ R^(r) ) < — / du 
\logn 

Writing n** = exp(— wlogn)) and integrating we have that the right-hand side is < K/{logn) 0. 
This concludes the proof of the first result in Theorem [2 □ 

The final step is to estimate the number of fragmentations that occur to tree components of 
size < n^'''^ at times between (1 — n~^^^)n/2 and n/2: 

n/2 

Q^^ ds 

(l-n-V3)„/2 

For each s in the interval the integrand is smaller than J2k=i ^Qi/n^k- Using Lemma IHJ the last 
quantity is smaller than 

^f;fc-V2exp(-/fcV3n2) 



n 

k=l 



which we can rewrite as 



^ ^-2/3(^^-2/3)-i/2 exp(-(A:n-2/3)3/2) 
" fc=i 

The above sum is a Riemann sum so it converges to j;-V2e-^V2(ia;. Therefore, Q^^ < Kn~'^l'^. 
Since the duration of the critical regime is v?/^ /2, the expected number of fragmentations is bounded 
and the proof of Theorem [2 is complete. 



6 The supercritical regime 

By Pittel's (1990) central limit theorem for the number of components of a supercritical random 
graph, it is enough to show that, with probability going to 1 as n — > oo, at time cn/2 there are 
fewer than o(n^/^) extra components due to fragmentation. (This was already indicated in the 
sketch of the proof of Theorem Q . 

Let a = 0.55. (In fact the results stated in this section would also be valid for any 1/2 < a < 2/3 
but making this choice makes some proofs slightly easier). We call cycles of size k> n"" large. These 
can be ignored since there cannot be more than n^~" = o(n^/^) such components. We define the 
amount of mass "upstairs" by 

n] = ^ kXkit) 

where Xk{t) is the number of cycles of size k at time n/2 + t. (It is convenient in this section 
to shift the time so that t = corresponds to critical time n/2.) If all of the mass was upstairs, 
then the expected number of cycles of size less than produced by fragmentation would be 
2n°'~^{cn/2) = 0(ri"). It is overly pessimistic to think that all of the mass will be upstairs, but 
by analogy with the random graph, we expect (and will eventually prove in Theorem ISJ that at 
times c > 1 a positive fraction of the total mass n will be there, so this estimate of the number of 
fragmentations is too large to ignore. 
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To improve this crude estimate, we take advantage of the fact that fragmented pieces are 
reabsorbed upstairs. Let xj^{t) be the number of cycles of size k produced by fragmentation of 
cycles upstairs. X^(t) can only increase when a transposition is performed, and only if it is made 
of one of the A^^ vertices upstairs and of one of the 2 points located k steps away when writing 
the corresponding cycle of the current permutation. This gives a rate at most 2NJ /n^. As for the 
death rate, one way to get rid of a component of size k is by picking one of the k vertices of one of 
the X^{t) components and one of the nJ vertices upstairs. This happens with rate 2kX^{t)N^ /n? . 
For the moment we are ignoring the fact that cycles may experience coalescence or fragmentation 
while downstairs. We will deal with these complexities once we have an understanding of the basic 
birth and death process of fragments of large clusters. 

6.1 The cluster queuing system 

It is fortunate that the unknown quantity < n appears in both rates, so that as along as > 
we can remove it by time change. Once this is done, we have a system of stochastic processes i^^, 
for I < k < n"' that we call a cluster queuing system: let be independent birth-and-death chains 
with birth rate 1 and death rate k^^, that begin with = 0. 

Lemma 9. With probability 1 as n ^ oo we have 



for allt <c (c > Oj. 

Remark. Although this system of stochastic processes can be defined without any reference to 
our random walk problem, it is useful to bear in mind that the state of this cluster queuing system 
at time t describes the number of fragments of large cycles at time 



since iVj < n. Thus the control obtained in the above lemma for all t < c, will provide useful 
information for the random walk between times n/2 and (1 + c)n/2 for any c > 0. On our original 
time-scale, this corresponds exactly to the supercritical regime, i.e. up to time cn/2 for any c > 1. 

Proof. The second result is a trivial consequence of the first. The key idea to handle the processes 
is to consider strips 2^ < k < 2^~^^. Because there are no simultaneous jumps, we can prove 
that the queues at each level k are independent processes (see e.g. Revuz-Yor (1999), chap. 
XII, prop. (1.7), for a proof of this fact in the case of Poisson processes). Therefore, for each 
1 < j < log2 n"", the number of cycles with sizes in [2\ 2-'"'"^), is dominated by a birth and death 
chain with birth and death rates respectively 2^ and 2^(!^f. To analyze these processes, we consider 
the successive excursions away from 0. Their embedded discrete time processes Ys jump from m to 
m — 1 with probability m/{m + 1) and from m to m + 1 with probability l/(m + 1). Let us try to 
find a function (f) such that 0(0) = 0, = 1 and (piYg) is a martingale. The latter implies 



< (log nf and ^ yt^f < n'^(log nf 



k=l k=l 




1 



[(j)[m + 1) — (/>(m)] 



[(j){m) — <j){m — 1)] 



m + 1 



m + 1 
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so 4>{x) = ~ 1)" Since = 1 and (j){0) = 0, it follows by optional sampling that the 

maximum level reached during an excursion of , M, satisfies 

P{M > x) = l/(j){x + l) <l/x\ (17) 

To bound the number of excursions for the process in the j'*^ strip before time c, Nj{c), we 
note that jumps from to 1 occur at rate 2^ so ignoring the amount of time it takes to return to 
from 1, the number of excursions by time c is bounded by a Poisson random variable with mean 
2^c < cn". Markov's inequality implies that P{Nj{c) > n?) < cn"'^'^ so 



P 



f max N,{c)>nA^Q (18) 



To estimate the probability that the maximum of v? excursions is > log n we recall (jlZp and that 
Stirling's formula implies k\ > 6Qk^e~'' /Vlirk for some 5o > 0, so 

(logn)! > 5i(logn)l°g"?i-i(logn)-i/2 ^ S^n^"slogn-i^^^^^yi/2 
The right-hand size goes to oo faster than n? log2 n so using (|18() we have 

P I max max rf > log n ) — > 

\l<j<alog2n 0<t<c J 

When the last event does not occur we have 

V^f < a(log2n)logn = -^(logn)^ 

Since a < 2/3 < log 2 f« 0.69, this gives the desired result. □ 



6.2 Completion of the proof of Theorem |3I 

The cluster queuing system is the first approximation to the analysis of the dynamics of the su- 
percritical regime. However, it ignores customer fragmentation and a number of "bad events" that 
we need to consider in order to give a rigorous proof of Theorem 0] Though a priori one might 
expect it to be difficult to take account of corrections of second order, third order, . . ., and have 
nightmares about adding up infinitely many terms, we were pleasantly surprised to see that the 
proof could be completed with a few simple estimates. 

The first technical problem to confront is to show that the total amount of mass upstairs stays 
positive at any given time so we can apply our time change. This is done in section ESI 

The more difficult problem is to control the difference between the CQS and the real system of 
clusters. To do this, we need a notational scheme to verify that we have indeed taken care of all 
of the relevant events. We call clusters of size larger than large, those in the CQS (i.e., those 
that were generated by a fragmentation of some large cycle), medium, and non-giant clusters in 
the random graph small. Writing frag and coag as shorthand for fragmentation and coagulation, 
we have three frag and six coag events to handle: 
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coag (small, small) is a natural part of the random graph so these events are not errors. The fragmen- 
tation of small clusters involves o(n^/^) clusters and hence does not significantly alter this process 
(see frag{small) and Lemma IT^ . 

coag (small, large) eliminates a small component, but in the random graph these correspond to the 
small cluster being absorbed into the giant component, so this is not an error. 

frag (small) is easy to take care of due to the duality principle which asserts that finite clusters in 
the random graph at time c > 1 have the same distribution as clusters at time cp < 1 where p is 
the probability of no percolation. This allows use to use our subcritical estimates for fragmentation 
of small supercritical clusters. More details are given in Lemma IT^ 

coag (large, large) We do not care about these events since we do not need to keep track of the 
number of cycles upstairs. 

frag(large) These are the arrivals in the cluster queuing system 

coag(medium,large) are (almost) the departures in the cluster queuing system. The problem is that 
the next three events can cause clusters to gain weight or split into two. 

coag (medium, medium) are helpful events since they reduce the number of customers in the CQS. 
This does make the fragmentation rate for the new cluster larger than the sum of the two previous 
clusters but Lemma ITTI will take care of this. More importantly, it makes the departure rate of the 
new cluster larger. This, applied to coag (medium, medium) and coag (medium, small), shows that the 
number of medium clusters is stochastically bounded by the CQS of section HTH and is the content 
of Lemma [TUl 

coag (medium, small) eliminates a small component, but in the random graph these correspond to the 
small cluster being absorbed into the giant component. Again, this also makes the fragmentation 
rate larger for the cluster that gained weight but Lemma ITTI will take care of this. 

frag (medium) is taken care of by Lemma [TTl 

To complete the proof it remains to prove the three promised lemmas. 

Lemma 10. The number of medium clusters is dominated by that of the CQS. Therefore there 
are never more than (logn)^ medium clusters, and never more than n"(logn)^ vertices in medium 
clusters. 

Proof. As was just mentioned, the only differences between the CQS and the medium clusters are 
generated by events of type coag(medium,medium) and coag(small, medium). However both those 
events do not increase the number of medium clusters, and both those events make the death rate 
of the clusters concerned higher. Hence we can construct the CQS and the medium clusters process 
on the same probability space, in such a way that the total number of medium clusters is smaller 
than that of the CQS. □ 

Lemma 11. The expected number of fragmentations of medium clusters is at most 0(n^"^^ (log n)^). 

Proof. There are never more than (logn)^ medium clusters. Since there are at most vertices 
per medium clusters the total number of vertices is at most n'^(logn)^. The rate at which those 
fragmenatations happen is thus bounded by 

/ n"(logn)^\ n" 
\ n J n 
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so that the expected number of such fragmentations is indeed 0(n^'^~^(Iog n)^). □ 
Lemma 12. The number of fragmentations of small components is o(n^/^). 

Proof. By a now famihar estimate, the expected number of fragmentations that produce clusters of 
size smaller than at times between n and n+t is at most 2nP~^t. So we can ignore fragmentations 
that (a) produce clusters of size smaller than n^''^^ before time cn/2 and (b) produce clusters of 
size smaller than n^-^^ at times between n and n + n^'^. 

If c > 1 the distribution of nongiant components in the random graph is given by progeny of a 
Poisson Galton Watson process with mean c on the event of its extinction. If we let p denote its 
extinction probability, then the offspring distribution conditional on extinction is given by 

p k\ ki 

since p = e~^^^~^\ In short, PGW{c) conditioned on extinction is PGW{cp). The last observation 
implies that results for finite supercritical clusters can be derived from those for subcritical clusters. 
In particular, by Lemma EJ the largest nongiant components seen after time n + n^'^, are smaller 
than n'^'^. Since fragmentations of such clusters necessarily produce pieces smaller than n^''^ these 
fragmentations can be ignored by (a). □ 

6.3 The initial mass upstairs 

The last step in the proof of Theorem 0] is to ensure that upstairs never becomes empty in this 
process. In other words we must prove that > for all t > with high probability, so that we 
can indeed time-change the queues by {nJ)~\ and use rigorously all the analysis carried out on 
(CQS) in section IHTTl This will be done by showing that initially there are already more vertices 
upstairs than will ever (with high probability) be taken away by fragmentation in the cluster 
queuing system. 

Lemma 13. Initially, upstairs contains at least Nq > Kn^~°'^'^ vertices. In particular Nq > 
n"(logn)^ and it never becomes empty during the supercritical regime. 

Proof. Lemma ini implies that when c = 1 the expected number of trees of size k 

ETk - "^^^expi-k^Sn^) 
V Ztt 



If we let |C>a| = Yl'kLn^ ^fc then it follows that 



oo „ 



5/2 ^ ^ ^l-3a/2 



Bollobas (1985) has calculated (see page 107) that the expected number of ordered pairs of trees 
of sizes j and k, 

E{T„Tk) < ETjETk 
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When j y^k this implies cov (T,-, T^) < and for j = k that ETk{Tk-l) < {ETkf or var (Tfc) < ETk. 
Summing we have 

var(|C>J) <^|C>,| 

and it follows from Chebyshev's inequality that |C>a|/-E'|C>a| ^ 1 in probability. These trees 
have not experienced fragmentation so their size is always at least n'^ and the total mass in large 
components is at least K-n}~"'/'^ . When a < 2/3 and n is large, this is much larger than the 
n"(logn)^ upper bound on the missing mass due to fragmentations. 

At this point the proof of Theorem 0] is complete. □ 
6.4 A sharper estimate for the mass upstairs 

In section [6.31 above, we have just proved that upstairs never becomes empty in the supercritical 
regime fLemma ll3() . But, as was already mentioned earlier, we expect by analogy with the random 
graph that in fact a positive fraction of all n vertices stay upstairs. This is the content of Theorem 
121 which we restate here for convenience and then prove. 

Theorem 13 For any c > 1, at time cn/2 there are at least 6{c)n — o{n) vertices located on large 
cycles (i.e., of size greater than or equal to n", for any a < 2/3). 

Proof. In fact it is a simple consequence of Lemmas and 1111 Indeed, the mass missing upstairs 
must be a piece of the random graph's giant component fallen downstairs by fragmentation. There- 
fore either it is a medium cluster or it has experienced a consecutive fragmentation. But we now 
know that there are never more than n"(logn)^ vertices in medium clusters by Lemma [TUl On the 
other hand, by Lemma ITTl the expected number of vertices in clusters having experienced multiple 
fragmentation has to be smaller than 

n" ■ Kn2«-i(logn)2 = o(n) 
as long as a < 2/3. □ 
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Figure 1: Breakpoint graph for human- mouse X chromosome comparison 
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36 37 17 40 16 

55 28 13 51 22 

6 7 35 64 33 

62 12 1 11 23 

48 3 21 53 8 

19 49 34 59 30 

27 38 50 26 25 

71 78 73 47 54 



15 14 63 10 9 

79 39 70 66 5 

32 60 61 18 65 

20 4 52 68 29 

43 72 58 57 56 

77 31 67 44 2 

76 69 41 24 75 

45 74 42 46 



Table 1: Order of the genes in D. repleta compared to their order in D. melanogaster 
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Figure 2. Average values of k — Dj^ for 10,000 simulations of the random transposition, 10,000 
simulations of k — dQ(7rj^) and Pevzner's 100 simulations of k—d^TTj^) for the reversal chain. The 
smooth curve of small triangles gives the limiting behavior as n ^ oo of {cn 12 — from 
Theorem 3. 
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Figure 3. Comparison of the distribution of the number of fragmentations in 10,000 simulations 
the random transposition chain with the Poisson distribution with the same mean. 
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Figure 4. Growth of t±ie average fraction of vertices in the cycle containing 1 at time cn/2 in 10,000 
simulations of the random transposition chain with n=100 and n=1000 compared to one-half the 
square of the percolation probability of the corresponding random graph with p = c/n. 
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Figure 5. Histogram of tlie size of the cycle containing 1 in 100,000 simulations of the random 
transposition chain with n=100 at time 100 (c=2). The open squares give the distribution of the 
of finite clusters in the corresponding random graph. 
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